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ABSTRACT 

Summary: We present ProbMetab, an R package that promotes sub- 
stantial improvement in automatic probabilistic liquid chromatog- 
raphy-mass spectrometry-based metabolome annotation. The 
inference engine core is based on a Bayesian model implemented 
to (i) allow diverse source of experimental data and metadata to be 
systematically incorporated into the model with alternative ways to 
calculate the likelihood function and (ii) allow sensitive selection of 
biologically meaningful biochemical reaction databases as Dirichlet- 
categorical prior distribution. Additionally, to ensure result interpret- 
ation by system biologists, we display the annotation in a network 
where observed mass peaks are connected if their candidate metab- 
olites are substrate/product of known biochemical reactions. This 
graph can be overlaid with other graph-based analysis, such as partial 
correlation networks, in a visualization scheme exported to 
Cytoscape, with web and stand-alone versions. 
Availability and implementation: ProbMetab was implemented in a 
modular manner to fit together with established upstream (xcms, 
CAMERA, AStream, mzMatch.R, etc) and downstream R package 
tools (GeneNet, RCytoscape, DiffCorr, etc). ProbMetab, along with 
extensive documentation and case studies, is freely available under 
GNU license at: http://labpib.fmrp.usp.br/methods/probmetab/. 
Contact: rvencio@usp.br 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Metabolomics is an emerging field of study in post-genomics, 
which aims at comprehensive analysis of small organic molecules 
in biological systems. Techniques of mass spectrometry coupled 
to liquid chromatography [liquid chromatography-mass spec- 
trometry (LC-MS)] stand out as dominant methods in 
metabolomic experiments. 



*To whom correspondence should be addressed. 



Although computational strategies have been used to filter and 
annotate mass peaks in LC-MS experiments (Dunn et ah, 2012), 
these methods do not include the addition of external informa- 
tion into a mathematical model in a principled way. Recently, 
Rogers et al. (2009) put forward a proof-of-concept in which 
information incorporated to a probabilistic model provides 
better annotation (Breitling et al., 2013). Their Bayesian model, 
by means of appropriate prior distribution selection, introduces 
the elegant idea of using a set of known chemical reactions 
among candidate compounds to improve annotation, as certain 
combinations, detected together, would make more biochemical 
sense than others. 

The state-of-the-art in probabilistic annotation established by 
Rogers et al. (2009) did not include an integrative computational 
implementation, a practical connection to public biological data- 
bases such as KEGG or MetaCyc (Altman et al., 2013) or a 
network-based output visualization schema. Therefore, our con- 
tribution is to fulfill these specific needs allowing easy access to 
this powerful statistical model for all metabolomic bioinfor- 
matics community. 

2 RESULTS AND CONCLUSION 

The platform chosen for implementation of these ideas was the 
well-known and established R programming environment, which 
incorporates a wide range of analyses including successful tools 
that perform preprocessing of spectral data required for metab- 
olite annotation (Supplementary Fig. SI) (Kuhl et al., 2011; 
Smith et al., 2006). 

Following Rogers et al. (2011) brief suggestion on how their 
previous method could be extended to incorporate additional 
experimental information and metadata, we implemented modi- 
fications to the likelihood term. Expanding the likelihood func- 
tion L in multiplicative independent terms allows one to account 
for additional orthogonal (independent) information sources: 
L = L N ■ L r , ■ L^, where subindexes N, rt and iso stand for 
measurement noise model, retention time error model and iso- 
tope profile error model, respectively. For a complete model's 
description, we refer the interested reader to the Supplementary 
Material. 
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The main product of a probabilistic annotation is a list of 
compound candidates ranked by their probabilities 
(Supplementary Fig. S2). To easily navigate over ProbMetabs 
results, we display tabular and dynamic network outputs along 
with supporting information, which assists practitioners to ultim- 
ately decide on most parsimonious annotations instead of forcing 
them to simplistically rely on the top probability assignment. All 
mass peaks are viewed as graph's nodes. Edges between two 
nodes are drawn if any candidate compound assigned to the 
outgoing node can be metabolized to any candidate compound 
assigned to the incoming node by means of a known biochemical 
reaction (Supplementary Fig. S3). ProbMetab is capable of pro- 
ducing reaction graphs and export them as standard Cytoscape 
input files or broadcasting the necessary graph data and attri- 
butes (colour, shapes, etc) directly to Cytoscape Desktop using 
RCytoscape (Shannon et al., 2013). This information can be 
easily overlaid with other widely used systems biology strategies 
such as correlation or partial correlation networks. If a mass 
spectra time-series or biological replicates are available, 
ProbMetab uses third-party packages integrated downstream 
to export correlation or partial correlation graphs, along with 
their intersection/difference with the reaction graph. 

Alternatively, a biologist can visualize ProbMetab's results in 
a simplified searchable web interface. Our package has a function 
that is responsible to consume an online web-service, which 
checks and renders the broadcast results as a web page. The 
visualization approach was developed taking advantage of the 
cytoscape.js library (Lopes et al., 2010) and its dependencies and 
can be easily integrated or embedded into any html5 web 
application. 

ProbMetab's documentation brings two detailed case studies 
in which all its features are explored. Moreover, to highlight 
integration with downstream and upstream third-party R pack- 
ages, data analysis examples mentioned are carried out from raw 
data, following through preprocessing until it reaches 
ProbMetab's specific point of action. We used publicly available 
data from Trypanosoma brucei, causative agent of sleeping sick- 
ness, and an original dataset from Saccharum officinarum (sugar- 
cane), an important biofuel source, to illustrate several points in 
typical metabolomics analysis sections. 

The T. brucei dataset, obtained from the mzMatch.R project 
website, was chosen because it presents a set of metabolites iden- 
tified with the aid of internal control standard compounds, being 
specially suited for performance evaluation. With this validation 
dataset, we compare the MetSamp (http://www.dcs.gla.ac.uk/in 
ference/metsamp/) implementation from Rogers et al. (2009) 
with ProbMetab's implementation and show that, the efficient 
R/c++ integrated function (Eddelbuettel and Francois, 2011) 
had a 3-fold running time improvement over the MATLAB im- 
plementation. For both implementations, the higher probability 
candidate was the true identity in up to 60% of the metabolites. 
However, instead of reporting only the higher probability candi- 
date identity as proposed by Rogers et al. (2009), we show that 
exporting the complete ranking in summarized visualizations, up 
to 90% of metabolite identities are among the top three higher 
probabilities. The full or filtered ranking allows the experimenter 
to associate the candidates with additional information present 
in this output and attribute the correct identity. 



The sugarcane dataset was chosen to exemplify differential 
expression of annotated metabolites in contrasting environmen- 
tal perturbation. We successfully recovered changes in a known 
stress response pathway (flavone and flavonol biosynthesis), 
showing the importance of a network-centric visualization for 
metabolite annotation to track metabolism changes. The bench- 
mark dataset confirms, as preconceived by Rogers et al. (2009), 
that a probabilistic model using orthogonal data and metadata 
yields better automatic mass peak annotation. The perturbation 
dataset shows that probabilistic annotation can produce other- 
wise impossible interpretation for differential network 
connectivity. 

We implemented a method to annotate compounds in a com- 
putational framework that allows the introduction of prior 
knowledge and additional spectral information. With the R 
package ProbMetab, we provide ways to summarize the results 
of series of analysis needed to extract information from complex 
high-dimensional MS data, and help the experimenter to track 
metabolism changes in the process of interest. 
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